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A  DIGIT  PIPELINED  DYNAMIC  TIME  WARP  PROCESSOR 


Mary  Jane  Irwin 

Department  of  Computer  Science 
The  Pennsylvania  State  University 
University  Park,  PA  16801 


Abstract 

A  custom  CMOS  VLSI  processor  is  presented  which  can  achieve  real-time  isolated 
word  recognition  for  dictionaries  of  up  to  2000  words.  The  processor  is  based  on  the 
dynamic  time  warping  (DTW)  algorithm,  an  exhaustive  search  technique  which  permits 
nonlinear  pattern  matching  between  an  unknown  utterance  and  a  reference  word.  Our 
design  differs  from  previous  DTW  designs  in  that  (l)  all  data  is  represented  in  signed¬ 
digit,  base  4  format;  (2)  digits  are  passed  between  processing  elements  in  a  most 
significant  digit  first,  digit  serial  fashion;  and  (3)  the  algorithms  are  pipelined  at  the  di¬ 
git  level.  Because  of  these  features,  nine  processing  elements  will  fit  on  one  chip  using  3 
micron  feature  size  devices  and  an  84  pin  package.  The  VLSI  DTW  processor  presented 
is  both  flexible  and  modular.  The  design  is  independent  of  the  number  of  coefficients 
per  frame  and  the  precision  of  those  coefficients.  The  design  is  also  easily  expandable  in 
the  number  of  frames  per  word  and  the  warp  factor  used  to  achieve  the  nonlinear 
matching. 
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Isolated  word  recognition,  dynamic  time  warping,  digit  pipelining,  signed-digit  represen¬ 
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Introduction 


A  key  component  of  any  speech  recognition  system  is  the  hardware  which 
matches  an  unknown  utterance  in  a  speech  sample  against  a  dictionary  of  words.  The 
input  speech  waveform  must  first  be  filtered,  correlated,  analyzed,  partitioned  into  dis¬ 
tinct  utterances,  and  time  normalized  before  this  pattern  matching  can  begin.  Isolated 
word  recognition  can  be  performed,  for  example,  by  using  beam  searching  [AnB],  two- 
level  DP  matching  (Sakj,  or  dynamic  time  warping  (DTW)  [e.g.,  Ack,  SaC,  YoJ].  These 
algorithms  differ  in  their  search  techniques  (e.g.  exhaustive  or  best  first)  and  in  their 
scoring  methods  (e.g.,  best  match  or  maximum  likelihood).  However  it  is  accomplished, 
isolated  word  recognition  is  one  of  the  most  time  consuming  of  all  the  tasks  in  serial 
word  recognition  systems.  Of  course,  real-time  recognition  is  a  goal.  A  custom  VLSI 
design  for  a  DTW  processor  is  proposed  which  is  real  time  for  dictionaries  containing  up 
to  2000  words. 

A  block  diagram  of  that  portion  of  the  speech  recognition  system  we  are  con¬ 
cerned  with  in  this  paper,  isolated  word  recognition,  is  shown  in  Figure  1.  The  parame¬ 
ters  characterizing  the  unknown  utterance  have  been  time  normalized  into  a  set  of  I 
frames  ordered  in  time.  A  time  normalization  processor  has  been  used  to  stretch  or 
shrink  the  unknown  utterance  to  this  fixed  length.  In  our  DTW  processor  we  assume 
utterances  have  been  time  normalized  into  42  frames  [YoJ],  although  our  design  is 
modular  and  extendable.  By  varying  the  number  of  VLSI  packages,  a  DTW  processor 
with  any  multiple-of-three  number  of  frames  can  be  assembled.  The  parameters  of  each 
frame  are  the  coefficients  characterizing  the  unknown  utterance  at  that  point  in  time. 
Linear  predictive  coding  [RaS]  can  be  used  to  produce  these  frame  coefficients,  typically 
between  four  and  sixteen  coefficients  per  frame.  Each  of  these  frame  coefficients  is  a  6- 
bit  to  24-bit  positive  (or  unsigned)  integer.  We  assume  16-bit  coefficients  for  our  initial 
timing  estimates.  However,  the  hardware  in  the  DTW  processor  has  been  designed  in 
such  a  way  as  to  be  independent  of  both  the  number  of  coefficients  per  frame  and  the 
number  of  bits  used  to  represent  each  coefficient.  Only  the  recognition  rate  is  affected. 
The  reference  dictionary  stores  the  coefficients  of  the  words  against  which  the  utterance 
is  to  be  matched.  Each  of  these  reference  words  has  also  been  time  normalized  to  I 
frames  of  four  to  sixteen,  6-bit  to  24-bit  positive  integer  coefficients.  We  assume  a  refer¬ 
ence  dictionary  of  up  to  2000  words.  (Webster's  Ninth  New  Collegiate  Dictionary  con¬ 
tains  over  160,000  words.) 
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Figure  1.  Isolated  Word  Recognition 


Our  goal  was  to  design  an  isolated  word  recognition  VLSI  processor  which  could 
achieve  real-time  processing  for  reference  dictionaries  of  up  to  2000  words.  The  input 
sampling  rate  typically  ranges  from  6.25  KHz  to  20  KHz  [YoJ].  After  this  input  speech 
waveform  has  been  processed  and  time  normalized  into  distinct  utterances,  words  are 
supplied  to  the  isolated  word  recognition  processor  (the  DTW  processor  in  Figure  1)  at 
the  average  rate  of  one  utterance  eveiy  half  second  [YoJ].  As  the  unknown  utterance  is 
matched  against  an  entry  in  the  reference  dictionary  a  match  factor  is  generated.  This 
match  factor  consists  of  between  four  and  sixteen,  6-bit  to  24-bit  positive  integers  indi¬ 
cating  how  close  each  of  the  characterizing  coefficients  matched,  "summed”  across  the  I 
frames.  A  match  factor  is  produced  for  each  word  in  the  reference  dictionary.  A  deci¬ 
sion  rule  processor  then  uses  these  match  factors  to  select  the  closest  match.  A  match 
of  an  unknown  utterance  against  the  entire  dictionary  contents  must  be  done  in  half  a 
second  or  less  to  meet  the  real-time  goal. 


Dynamic  Time  Warp  Processor 

Our  isolated  word  recognition  processor  is  based  on  a  DTW  algorithm  [Ack, 
BAW,  KNL,  MaR,  Owe,  QGG,  SaC,  WBA,  YoJ],  an  exhaustive  search  technique  which 
permits  nonlinear  pattern  matching  between  the  unknown  utterance  and  the  words  in 
the  dictionary  by  allowing  for  natural  timing  variations  between  the  unknown  utterance 
and  the  reference  words.  An  illustration  of  a  simple  DTW  processor  for  a  four  frame 
utterance  is  shown  in  Figure  2.  The  horizontal  axis  represents  the  reference  words  out¬ 
put  from  the  dictionary  which  have  been  normalized  into  four  frames  ( R0  through  R3) 
each  representing  a  successive  time  interval.  Associated  with  each  frame  are  the 
coefficients  characterizing  the  word  at  that  point  in  time.  The  vertical  axis  represents 
the  unknown  utterance  which  has  been  time  normalized  into  four  frames  ( U0  through 
U 3)  of  coefficients.  This  simple  DTW  processor  contains  a  grid  of  sixteen  processing  ele- 


meats  numbered  PE(0,0)  through  PE(3,3).  Each  PE  is  connected  to  six  of  its  nearest 
neighbors. 

Match  Factor 
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Figure  2.  A  Simple  DTW  Processor  (I  =  4) 

A  single  PE  from  the  simple  DTW  processor  of  Figure  2  is  shown  in  Figure  3. 
Each  PE  must  be  able  to  compute 

Si  j  —  Di  j  +  minimum^,  _u  ,  5,- 5, 

where  D,  }  is  the  local  "distance"  between  a  coefficient  of  frame  j  of  the  reference  word 
and  a  coefficient  of  frame  i  of  the  unknown  utterance  and  S,  ;  is  the  accumulated 
minimum  "distance"  along  a  path  of  adjacent  PE’s.  We  use  the  simple  local  distance 
measure  of  the  absolute  value  of  the  difference  of  Rj  and  Ui .  Other  choices  for  com¬ 
puting  the  local  distance  are  to  use  a  squared  Euclidean  distance  measure  [MaR]  or  a 
logarithmically  scaled  distance  measure  (YoJ).  Weighting  factors,  usually  some  multiple 
of  two,  can  also  be  applied  to  the  local  distance  measures  [MaR.  MRR,  RaS).  We  use  a 
weighting  factor  of  one  only.  In  Figures  2  and  3,  the  computed  values  which  are  passed 
between  neighboring  PE’s  are  indicated  by  —  ,  /,  and  j  lines,  while  the  coefficient  which 
are  merely  passed  through  a  PE  to  its  neighbor  are  indicated  by  — ►  and  f  arrows. 
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Figure  3.  One  DTW  Processing  Element 

Dynamic  programming  [Bel]  is  used  to  find  the  minimum  distance  path  from  PE 
(0,0)  to  PE  (I- 1,1-1).  By  imposing  a  slope  constraint  on  this  path,  only  nearest  neighbor 
interconnect  of  PEs  is  necessary  [MaR].  Fortunately,  since  speech  is  essentially  "well 
behaved,”  limiting  the  slope  of  the  path  does  not  adversely  affect  the  recognition  accura¬ 
cy.  (If  weighting  factors  are  applied  to  each  branch  of  the  path,  a  bias  in  the  path 
movement  through  the  mesh  can  also  be  imposed.)  Parallel  computation  proceeds  in  sys¬ 
tolic  waves  from  lower  left  to  upper  right  in  the  grid  of  PE’s  in  Figure  2,  where  all  PE’s 
on  the  same  back  diagonal  simultaneously  execute  in  the  same  wave.  Four  t,o  sixteen 
accumulated  minimum  distances  (one  for  each  coefficient)  which  are  the  sum  of  the 
minimum  local  distances  along  a  path  from  PE  (0,0)  to  PE  (I- 1,1-1)  constitute  the  match 
factor  which  tells  how  close  the  the  match  was  between  the  two  words.  In  this  way,  a 
match  factor  is  produced  for  every  word  in  the  dictionary. 

In  the  DTW  processor  shown  in  Figure  2,  there  may  be  no  need  to  match  R  3 
against  U 0  if  the  time  discontinuity  between  the  reference  word  and  the  unknown  utter¬ 
ance  can  never  be  that  large.  In  this  case,  PE  (3,0)  and  PE  (0,3)  are  unnecessary.  A 
"warping  factor”  of  two  results;  the  reference  frame  coefficients  and  the  unknown  frame 
coefficients  two  before  and  two  after  the  intersecting  PE  may  be  included  in  the  accu¬ 
mulate  minimum  distance.  With  a  warp  factor  of  one,  PE’s  (3,0),  (3,1),  (2,0),  (0,3),  (1,3) 
and  (0,2)  would  be  unnecessary. 

We  assume  a  warp  factor  of  six  [YoJ]  in  our  DTW  processor,  although  once  again 
by  varying  the  amount  of  hardware  any  multiple-of-three  warp  factor  could  be  accom¬ 
modated.  The  entire  DTW  processor  for  an  I  of  42,  a  warp  factor  of  six,  a  variable 
number  of  coefficients  per  frame,  and  a  variable  number  of  bits  per  coefficient  is  depict¬ 
ed  in  Figure  4.  This  DTW  processor  design  uses  504  PEs. 
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Figure  -4.  The  DTW  Processor 

Several  DTW  processors  for  isolated  and  connected  word  recognition  (and  other 
pattern  matching  problems)  have  been  proposed  and  implemented  in  VLSI  [Ack,  BAW, 
KNL,  MaR,  Owe,  QGG,  WBA).  In  the  AT&T  DTW  processor  [Ack,  BAW,  WBA]  as  in 
many  of  the  other  VLSI  DTW  designs,  one  overriding  consideration  was  the  need  to  in¬ 
tegrate  at  least  one,  and  preferably  more  than  one,  PE  onto  a  single  VLSI  chip.  With 
one  PE  per  chip,  a  minimum  of  four  data  ports  are  required  (the  diagonal  interconnect 
can  be  omitted  by  routing  5,  _j ;  _t  through  a  north  or  east  neighboring  PE).  Because  of 
this  large  I/O  requirement  for  a  64  pin  package,  a  bit  serial  communication  scheme  was 
used.  The  16-bit  coefficient  operands  are  streamed  into  the  chip  in  a  bit  serial,  least 
significant  bit  first  fashion.  Although  the  local  distance  could  be  computed  as  the  bits 
are  input  using  a  conventional  serial  adder/subtractor,  to  do  the  minimum  operation  as 
the  bits  are  being  input  requires  the  operands  to  be  delivered  most  significant  bit  first. 
Apparently,  these  two  primitive  operations  have  conflicting  I/O  requirement,  one  requir¬ 
ing  its  operands  least  significant  bit  first  and  one  requiring  them  most  significant  bit 
first.  Of  course,  both  operations  can  be  done  if  the  full  precision  operand  values  are 
available.  So,  in  the  AT&T  DTW  processor  the  operands  are  first  collected  into  16-bit 
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Input  registers,  then  the  distances  are  computed  using  the  full  precision  values.  After 
processing,  the  operands  are  streamed  out  of  the  chip  once  again  in  a  bit  serial,  least 
significant  bit  first  f  shion.  While  this  drastically  reduces  the  interconnect,  it  also  limits 
performance  since  it  takes  16  clock  cycles  to  transfer  one  operand. 

Our  VLSI  DTW  processor  differs  from  previous  designs  in  that  the  data  is  passed 
between  PE’s  as  base  4,  signed-digits  (Atk],  most  significant  digit  first.  Base  4,  serial 
signed-digit  arithmetic  components  [IrOj  are  used  to  compute  the  distance  digits  as  the 
operand  digits  are  input.  For  each  pair  of  operand  digits  input  (except  the  first,  most 
significant  digit)  an  output  digit  is  produced.  As  soon  as  an  output  digit  has  been  gen¬ 
erated  by  one  PE,  it  is  passed  along  to  the  next  PE  as  an  input.  By  pipelining  opera¬ 
tions  at  the  digit  level  [IrO]  in  this  way,  the  speed  limitations  of  serial  communication  is 
obviated.  A  very  high  degree  of  concurrency  and  a  high  data  rate  can  be  maintained, 
while  the  pin  out  requirements  are  kept  low. 

In  the  following  section,  the  functioning  of  the  digit  serial  primitive  arithmetic 
components  required  in  a  DTW  processor  PE  are  described.  A  PE  is  characterized  by 
its  I/O  requirements  and  size  (dimensions  in  X  for  CMOS  custom  layout).  The  full 
DTW  processor  is  then  considered  with  an  appropriate  partitioning  into  VLSI  chips  us¬ 
ing  MOSIS  standard  frame  sizes  [MOS].  Finally  some  speed  estimates  for  the  isolated 
word  recognition  problem  are  given. 


Digit  Serial  Primitive  Arithmetic  Components 

In  our  DTW  processor  the  data  is  transferred  between  PE’s  digit  serially,  where  a 
digit  is  a  base  4  signed-digit.  (In  our  notation  we  use  capital  letters  to  denote  full  preci¬ 
sion  operands  and  small  letters  to  denote  digit  values.)  Computed  values  are  produced 
in  the  maximally  redundant  symmetric  signed-digit  set  for  base  4,  the  digit  set 
{-3,  -2,  -1,  0,  1,  2,  3}.  For  example,  in  this  digit  set  the  numerical  value  6310  can  be 
represented  in  four  digits  as  0333.,,  1133.,,  10134,  or  10014  where  1  =  -1.  Input  values, 
the  dictionary  and  unknown  utterance  coefficients,  are  represented  in  the  conventional 
base  4  digit  set,  the  digit  set  (0,  1,  2,  3}.  Since  this  is  a  subset  of  the  signed-digit  set 
used  in  the  DTW  processor,  no  conversion  on  input  is  required.  The  match  value  is  pro¬ 
duced  in  signed-digit  representation.  Two’s  complement  encoding  is  used  to  encode  the 
digits  into  a  binary  form.  Thus,  three  wires  are  required  to  transmit  a  distance  digit, 
while  only  two  wires  are  required  to  transmit  a  coefficient  digit.  For  example,  10134 
would  be  encoded  in  12  bits  as  001  000  111  01 12.  The  encoding  "100”  is  not  used  and 
thus  is  available  for  signaling  overflow,  NAN  (not  a  number)  conditions,  or  for  use  in 
hardware  error  detection.  All  data  has  fixed  and  constant  word  length  of  p  base  4  di- 


gits;  for  the  speech  recognition  problem  p  =  8  for  16-bit  coefficient,  although  we  will 
use  p  =  4  in  the  numerical  examples  in  this  section  for  simplicity.  Thus,  24-bits  are  re¬ 
quired  to  encode  a  conventional  16-bit  value  into  a  signed-digit  representation.  In 
signed-digit  representation  the  sign  of  the  operand  is  the  sign  of  the  most  significant  di¬ 
git;  i.e.,  a  separate  sign  digit  is  not  required.  Also,  there  is  only  one  representation  for 
the  operand  zero  -  the  digit  string  consisting  of  all  zero. 

The  data  flow  in  the  DTVV  processor  is  digit  serial  with  the  most  significant  digit 
being  transmitted  first.  The  primitive  arithmetic  components  with  this  data  flow 
characteristic -required  to  build  the  DTVV  PE  are  a  digit  serial  add/subtract  component, 
a  digit  serial  minimum  component,  and  a  digit  serial  absolute  value  component. 
Operand  digits  are  input  to  a  primitive  arithmetic  component  one  set  per  clock  cycle. 
The  most  significant  result  digit  is  generated  one  clock  cycle  after  the  most  significant 
digits  of  the  operand(s)  are  input.  After  the  most  significant  result  digit  is  output,  a 
result  digit  is  output  for  each  subsequent  cycle.  This  time  between  when  the  first  digit 
of  data  is  input  and  when  the  first  digit  of  the  output  is  generated  is  the  component  la¬ 
tency.  In  our  design  all  primitive  components  have  a  fixed  latency  of  one.  Further¬ 
more,  the  operations  are  pipelined  at  the  digit  level.  When  a  digit  of  result  is  output 
from  one  primitive  arithmetic  component  it  is  passed  immediately  to  the  next  as  an  in¬ 
put. 

Add/ Subtract  Component 

The  digit  serial  add/subtract  component  is  defined  in  Figure  5.  A  generic 
add/subtract  component  is  discussed  here,  but  in  the  DTVV  processor  separate  add  and 
subtract  components  are  needed.  In  Figure  5,  the  operator  ”o  ”  represents  either  a  sub¬ 
traction  operation  for  computing  the  local  distance  or  an  addition  operation  for  forming 
the  accumulated  minimum  distance.  When  computing  the  local  distance,  xin  —  uin  , 
and  Vin  =  r,„  which  are  both  in  the  conventional  digit  set  {0,  1,  2,  3}.  When  comput¬ 
ing  the  accumulated  minimum  distance,  j,n  is  the  digit  value  computed  by  the  local  dis¬ 
tance  absolute  value  subtraction  operation  and  yIB  is  the  digit  value  computed  by  the 
digit  serial  minimum  operation,  both  of  which  are  in  the  digit  set  {-3,  -2,  -I,  0,  1,  2,  3}. 
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where 


(1)  s;  =  *,«  o  2/in  -  4  Cj 


where  -1  <  c;  <1  is  chosen  so  that  -2  <  s;-  <2 
(“)  ^ouf  =  5;-l  "h  cj 

Figure  5.  MSD  Digit  Serial  Adder/Subtractor 

The  present  interim  sum  digit,  s; ,  which  is  determined  by  the  first  digit  opera¬ 
tion  (1),  is  stored  in  an  internal  latch  to  be  used  in  the  next  cycle.  Then  a  second  digit 
operation  (2)  adds  the  present  interim  cany,  c;  ,  to  the  previous  interim  sum  and  this 
result  is  output.  Most  importantly  this  addition  is  guaranteed  to  be  carry  free  because 
of  the  restricted  digit  sets  allowed  for  the  interim  carry  and  the  interim  sum.  The  la¬ 
tency  of  the  adder/subtractor  primitive  component  is  one;  one  clock  cycle  after  the  most 
significant  digits  of  the  operands  are  input  the  most  significant  digit  of  the  result  is  out¬ 
put. 

Table  1  gives  the  input  digits/interim  digits  mapping  for  the  first  digit  operation, 
(1)  in  Figure  5.  The  local  distance  computation,  uin  -  rin  ,  u^es  only  the  subset  of  the 
table  from  -3  to  3. 
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The  second  digit  addition  mapping  is  straightforward.  The  addition  component  requires 
two  digit  adder  cells  and  a  one  digit  storage  cell  as  shown  in  the  block  diagram  of  Fig¬ 
ure  6.  (The  first  digit  adder  cell  is  decomposed  further  into  a  digit  adder  sum  cell  (Is) 
and  a  digit  adder  carry  cell  (lc).)  Each  of  these  cells  can  be  built  in  four  levels  of  two  in¬ 
put  logic  gates. 


Figure  6.  ADD  Logic  Block  Diagram 


Cells  of  the  CMOS  adder  component  have  been  fabricated  by  MOSIS  using  3  mi¬ 
cron  feature  size,  double  layer  metal  CMOS  mesh  arrays  [Bee].  Preliminary  tests  on 
these  fabricated  chips  lead  us  to  believe  that  a  gate  delay  of  less  than  two  nsec  can  be 
achieved.  The  mesh  array  layout  plot  of  the  adder  primitive  component  is  shown  in 
Figure  7.  It  is  approximately  200X  by  200X  in  size.  Only  one  primitive  adder  com¬ 
ponent  is  needed  regardless  of  the  operand  precision. 
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Figure  7.  Layout  of  ADD 
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Examples  of  msd,  base  -1,  digit  serial  addition  are  given  in  Table  2  for 
X, n  =  1223^  and  Yin  =  12134  followed  immediately  by  Xin  =  12324  (  =  11124)  and 
n  ~  11214  (  =  11134).  The  correct  first  sum  Zout  =  31024  which  is  produced  by  the 
adder  would  incur  a  full  precision  carry  if_performed  on  a  ripple-carry  adder.  The 
second  sum  is  correctly  formed  as  Zout  =  31114  (  =  22314).  An  in  the  example  in¬ 
dicates  that  that  value  is  not  used  in  the  computation.  For  a  sequence  of  additions  (or 
subtractions),  the  msd’s  of  the  operands  for  the  next  addition  can  be  input  at  the  next 
clock  after  the  lsd’s  of  the  operands  for  the  present  addition  have  been  input.  As  long 
as  overflow  is  not  allowed,  the  interim  carry  formed  during  operation  (1)  for  the  next 
addition  is  guaranteed  to  be  zero.  Note  that  this  interim  carry  is  the  one  used  in  opera¬ 
tion  (2)  to  complete  the  present  addition  and  if  it  was  not  zero  an  incorrect  lsd  for  the 
present  addition  would  be  output. 


The  primitive  arithmetic  components  of  the  DTW  processor  have  been  designed 
assuming  that  intermediate  results  will  never  overflow.  If  overflow  does  occur,  the  least 
significant  digit  of  the  previous  intermediate  result  is  contaminated.  This  situation  can 
be  avoided  by  having  the  ADD  component  which  is  used  to  compute  the  accumulated 
distance  measure  "corrupt”  the  most  significant  result  digit  on  overflow.  Since  the 
DTW  processor  is  looking  for  the  closest  match  (i.e.,  the  smallest  match  factor),  this 
result  will  not  be  used  anyway,  so  an  exact  result  is  not  required.  Overflow  can  be  inhi¬ 
bited  altogether  by  using  the  digit  mapping  shown  in  Table  3  for  digit  operation  (l)  in 
the  ADD  component,  but  only  for  the  most  significant  digit.  As  we  will  see,  the  arrival 
of  the  most  significant  digit  of  an  operand  coincides  with  a  wave  control  signal,  Win  , 
one  of  which  is  broadcast  along  each  back  diagonal  in  the  PE  array. 


Table  3.  Corrupted  MSP  ADD  Operation  111  Manning 
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Absolute  Value  Component 


The  digit  serial  absolute  value  component  is  defined  in  Figure  8. 


ABS 


101lt 


where  zout  =  j  xin  | 

Figure  8.  MSD  Digit  Serial  Absolute  Value 

The  functioning  of  the  absolute  value  component  depends  upon  the  sign  of  the  most 
significant  non-zero  digit.  Thus,  the  absolute  value  computation  is  an  operation  which 
requires  its  inputs  most  significant  digit  first.  If  the  most  significant  digit  is  positive,  no 
further  arithmetic  processing  on  the  operand  digits  are  required.  If  the  most  significant 
digit  is  negative,  both  that  digit  and  all  subsequent  (less  significant  digits)  will  need  to 
be  negated  (two’s  complemented).  The  three  state,  finite  state  machine  shown  in  Figure 
9  more  completely  defines  the  ABS  component.  In  the  diagram,  the  next  output  digit 
value  is  shown  in  parenthesis  below  the  state  transition  condition.  The  wave  control 
signal,  Win  ,  which  is  active  on  arrival  of  the  most  significant  digit  of  the  operands,  is 
used  to  initialize  the  finite  state  machine  to  state  A. 

xin  — 0  Win  — 1 


Figure  9.  Finite  State  Machine  for  ABS 


For  example,  when  Xin  —  12324  (  =  -  01024)  the  absolute  value  component 
would  output  =  12324  (  =  01024).  Although  the  result  digit  can  be  determined 
based  on  only  the  current  input  digit,  the  latency  of  the  ABS  component  is  defined  to 
be  one  to  be  consistent  with  the  other  primitive  components.  Thus,  the  result  digit  is 
held  until  the  next  clock  cycle  before  it  is  output.  The  ABS  primitive  component  has 


been  estimated  at  150X  by  150X  in  custom  logic,  double  metal  CMOS.  As  with  ADD, 
only  one  component  is  needed  regardless  of  the  operand  precision.  While  this  operation 
could  be  combined  with  the  subtract  component  to  build  an  absolute  value  subtractor, 
we  discuss  them  here  as  separate  components  for  simplicity. 


Minimum,  Component 

The  digit  serial  minimum  component  which  computes  the  minimum  of  two  input 
values  is  defined  in  Figure  10. 


MIN 


zout 


Vin 


where  zout  =  MIN  (xin  ,  yin  ) 


Figure  10.  MSD  Digit  Serial  Minimum 


Finding  the  minimum  of  two  operands,  an  operation  that  is  very  straightforward  when 
dealing  with  a  conventional  digit  set,  becomes^  more  difficult  when  using  a  signed-digit 
set.  Consider,  for  example,  when  Xin  =  12324  and  Fln  =  11214.  One’s  first  guess 
might  be  to  choose  Yin  as  the  minimum,  but  in  fact  Xin  is  the  actual  minimum  value. 
Recall  that  in  digit  serial  processing  with  a  latency  of  one,  the  hardware  can  only  "look 
ahead”  one  digit  position.  The  five  state,  finite  state  machine  [Nas]  given  in  Figure  11 
more  completely  defines  the  MIN  component.  The  wave  control  signal  Win  is  once 
again  used  to  initialize  the  finite  state  machine  to  state  A. 
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As  with  the  ABS  component,  the  latency  of  the  MIN  component  is  defined  to  be 
one.  Logic  equivalent  to  a  one  digit  adder  cell  is  needed  to  determine  both  the  next 
state  and  the  correct  output  digit.  The  MIN  primitive  component  has  been  estimated 
at  300X  by  200X  in  custom  logic,  double  metal  CMOS.  As  with  ADD  and  ABS,  only 
one  component  is  needed  regardless  of  the  operand  precision.  Three  MIN  components 
are  needed  to  find  the  minimum  of  three  values  as  required  in  the  DTW  PE. 

Using  the  finite  state  machine  in  Figure  11  two  msd,  base  4,  digit  serial  minimum 
examples  are  illustrated  in  Table  4.  The  first  MIN  operation  where  Xtn  =  12324  and 
Yin  ~  U2I4  outputs  Z01lt  =  11124,  a  numerically  equivalent,  recoded  version  of  X,n  , 
the  actual  minimum  input  value.  The  second  MIN  operation  where 
X\n  ~  12324  (  =  -01024)  and  F,n  =  12224  (=  -01124)  correctly  outputs  the  numerical¬ 
ly  minimum  value  Zout  =  12224. 
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Digit  Pipelined  DTW  Processing  Element 

Our  dynamic  time  warp  PE  is  built  out  of  the  primitive  arithmetic  components 
described  in  the  previous  section.  Recall  that  each  PE  in  the  DTW  processor  must  be 
able  to  compute 

si,j  =  I  u,  -Rj  |  +  minimuin(5,  _u  A-ij-iA  j-i) 

=  ADD(ABS(SUB(«,  r; ))  +  MIN(MIN(s(.I  ;.  ,s,  ^JMIN (•i.u.lt9iJ.l))) . 
The  internal  block  diagram  description  of  one  such  PE  is  shown  in  Figure  12. 
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Figure  12.  Block  Diagram  of  the  DTW  PE 

Each  PE  contains,  in  addition  to  three  MIN  components,  one  ADD  component,  one 
SUB  component,  and  one  ABS  component,  three  three-bit  registers  for  latching  the  di¬ 
git  values  of  rout ,  uout ,  and  s0#( .  A  wave  control  signal,  W-n  ,  is  broadcast  to  all  of  the 
PE’s  on  the  same  back  diagonal  and,  therefore,  is  not  latched  through  the  PE.  Using 
the  estimates  for  the  sizes  of  the  primitive  arithmetic  components  given  in  the  previous 
section  and  a  X  =  1.5  microns  for  3  micron  feature  size  devices,  the  size  of  the  PE  is  es¬ 
timated  at  under  1000  microns  by  1000  microns. 

The  clocking  scheme  proposed  for  the  PE  is  shown  in  Figure  13.  Three  two- 
phase  clock  pulses  are  required  to  completely  process  one  digit  through  the  PE. 
MIN(*,_U,  MIN(s,_,  and  SUB(u, ,  r; )  are  done  during  the  first 

cycle,  MiN(MIN(«,  y  ,  s,-,,  y_j),  MIN(s,  _,  ;  _j,  s,- ;  _j))  and  ABS(SUB(u, ,  r; ))  are 
done  during  the  second  cycle,  and  the  final  addition  is  done  during  the  third  cycle.  The 
wave  pulse,  \Vk ,  will  reoccur  with  the  arrival  of  the  most  significant  digit  of  the  next 
coefficient  or  every  24  clock  pulses  for  eight  digit  (16-bit)  operands. 


phil 
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♦  MIN,  SUB  -»  ♦MIN,  ABS  -*■  < — ADD * 


Figure  13.  PE  Clocking  Scheme 

Assuming  a  gate  delay  of  two  nanosecond,  each  clock  phase  should  be  active  for 
eight  nanoseconds  since  each  primitive  arithmetic  component  cell  contains  four  levels  of 
logic.  Referring,  for  example,  to  the  logic  in  the  ADD  component  shown  in  Figure  6, 
digit  operation  (1)  would  be  performed  during  phil  and  digit  operation  (2)  during  phi2 
of  the  appropriate  clock  cycles.  Allowing  one  nanosecond  phase  isolation,  the  PE  would 
require  54  nanoseconds  to  process  one  digit  which  we  refer  to  as  the  PE  cycle  time. 
Since  the  primitive  arithmetic  components  are  all  digit  pipelined,  the  next  digit  of  the 
operands  are  processed  by  this  PE  during  the  next  PE  cycle  while  the  results  of  this  PE 
are  passed  to  its  neighbors  for  processing. 


Digit  Pipelined  DTW  Processor 

Obviously,  the  entire  504  PE  DTW  processor  shown  in  Figure  4  will  not  fit  on 
one  VLSI  chip  in  today’s  technology.  A  partitioning  of  the  processor  into  several  chips, 
considering  both  chip  area  and  pin  limitations,  must  be  done.  Available  MOSIS  stan¬ 
dard  frames  [MOS]  for  a  single  VLSI  chip  which  are  viable  candidates  for  the  DTW  pro¬ 
cessor  are  shown  in  Table  4. 


Table  4.  Avail 

able  MOSIS  Frames  f3  m 

cron  feature  size! 

Technology 

Max.  Project  Size 
( microns) 

No.  of  Pins 
f arid  arravl 

mu 

4600  x  6800 

6900  x  6800 

7900  x  9200 

84 

84 

84 

Various  partitionings  of  the  DTW  into  chips  were  considered.  The  most  promis¬ 
ing  one,  shown  in  Figure  14,  contains  three  chips  types  all  of  which  fit  into  MOSIS  stan- 


dard  frames.  Chip  type  A  contains  a  complete  three  by  three  array  of  PE’s.  Chip  types 
B  and  C,  which  are  only  partially  populated  with  PE’s,  are  mirror  images  of  one  anoth¬ 
er. 


Ba*«rwe«  Diet'ontry  bora 

Figure  14.  DTW  Partitioning  (1=42) 

Using  this  partitioning,  40  chips  of  type  A,  12  of  type  B,  and  12  of  type  C  for  a  total  of 
64  chips  are  required  to  implement  the  full  DTW  processor  of  I  =  42  and  a  warp  factor 
of  six.  Table  5  gives  a  sampling  of  the  number  of  chips  which  would  be  required  for 
various  frame  counts  and  warp  factors.  For  example,  the  entire  1764  PE  mesh  would  be 
needed  to  process  connected  speech  [QGGj. 
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Not  only  are  the  PE’s  nearest  neighbor  interconnected  within  each  of  the  chip 
types,  but  the  chips  are  also  interconnected  in  a  nearest  neighbor  fashion  at  the  board 
or  wafer  level.  The  chip  counts  do  not  include  the  memory  chips  required  to  hold  the 
reference  dictionary  or  the  control  chip  required  to  generate  the  broadcast  wave  control 
and  clocking  signals.  For  42  frames,  84  wave  control  signals  are  required,  one  for  each 
back  diagonal  (with  a  maximum  of  10  active  at  one  time  for  eight  digit  operands). 

The  block  diagram  of  chip  type  A  is  shown  in  Figure  15  with  data  and  wave  con¬ 
trol  signals  indicated.  Globally  routed  reset,  power,  ground,  and  clock  lines  are  not 
shown. 
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Figure  15.  Block  Diagram  of  Chip  Type  A 
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Using  the  area  requirement  of  under  1000  microns  by  1000  microns  for  each  PE,  MOSIS 
standard  frame  6900  microns  by  6800  microns  will  easily  accommodate  the  three  by 
three  grid  of  PE’s  while  leaving  ample  room  for  global  routing  and  pads.  Nine  digit  re¬ 
gisters  are  needed  to  latch  the  data  between  adjacent  waves  of  operation. 

While  the  area  requirement  is  easily  achieved,  the  pin  limitation  is  not.  A  count 
of  the  data  requirements  for  chip  type  A  is  six  pins  for  inputting  r;  ,  r;+1,  and  r  •  +2,  six 
for  outputting  r;  ,  rJ  +1,  and  r;  +2,  six  for  inputting  it,  ,  u,+1,  and  u,  +2,  six  for  output¬ 
ting  u, ,  u,  +1,  and  u,  +2,  twenty-one  for  inputting  the  seven  s,„ ’s  and  fifteen  for  output¬ 
ting  the  five  sout ’s  for  a  total  of  60  data  lines.  In  addition,  six  wave  control  signals  are 
required  as  well  as  a  global  reset  signal,  two  clock  lines,  power,  ground  and  substrate  for 
a  total  of  12  control  lines.  Thus,  chip  type  A  uses  72  of  the  84  available  pins. 


Isolated  Word  Recognition  Rate 

The  recognition  rate  of  the  DTW  processor  can  now  be  determined.  The  PE  cy¬ 
cle  time,  the  time  for  a  PE  to  process  one  digit  of  a  coefficient,  is  54  nsec.  With  84 
waves  in  a  42  frame  mesh,  4.536  usee  are  required  to  produce  one  digit  of  a  match 
coefficient.  This  is  also  the  amount  of  time  which  must  be  allowed  for  mesh  fill  time. 
With  eight  digits  per  coefficients,  36.288  usee  are  required  to  process  each  coefficient.  If 
there  are  eight  coefficients  per  frame,  290.5  usee  of  processing  time  once  the  mesh  is  full 
are  required.  Since  the  DTW  is  pipelined  at  the  digit  level,  once  the  mesh  is  full  and 
the  most  significant  digit  of  the  first  match  factor  coefficient  is  output  by  the  DTW  pro¬ 
cessor,  a  digit  is  output  every  54  nsec.  Assuming  that  a  real-time  recognition  rate  re¬ 
quires  that  all  of  the  match  factors  for  the  entire  reference  dictionary  can  be  generated 
in  about  half  a  second,  our  DTW  processor  can  process  a  reference  dictionary  of  2000 
words  with  eight,  eight  digit  coefficients  per  frame  in  real-time.  Table  6  summarizes  the 
word  recognition  rate  of  the  42  frame  DTW  processor  for  various  numbers  of  coefficients 
per  frame  of  various  precisions.  The  only  impact  on  the  DTW  hardware  when  changing 
the  number  of  coefficients  and/or  their  precision,  is  assuring  that  the  control  chip  coor¬ 
dinates  the  broadcast  wave  signal  with  the  delivery  of  the  most  significant  digits  of  the 
unknown  utterance  and  the  reference  words  to  the  processing  element.  The  DTW  pro¬ 
cessor  design  does  not  change. 
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Tahlp  fl.  Word  F 

procrnition  Rates  for  2000  Words  fl=42l 

#  Coef. /Frame 

#  Dieits/Coef. 

Recoen.  Rate  (seel 

4 

3 

.109 

4 

8 

.290 

6 

8 

.435 

8 

8 

.581 

12 

8 

.871 

16 

8 

1.161 

16 

12 

1.742 

Conclusions 

A  custom  VLSI  design  of  a  DTW  processor  for  isolated  speech  recognition  has 
been  presented  which  is  real-time  for  dictionaries  containing  2000  words.  Our  design 
differs  from  previous  DTW  processors  in  that  (1)  the  data  is  represented  in  signed-digit, 
base  4  format;  (2)  the  digits  are  passed  between  processing  elements  in  a  most 
significant  digit  first,  digit  serial  fashion;  and  (3)  the  algorithms  are  pipelined  at  the  di¬ 
git  level  with  each  primitive  component  having  a  latency  of  one.  This  allows  a  very 
high  degree  of  concurrency  and  a  high  data  rate  to  be  maintained,  while  the  pin  out  re¬ 
quirements  are  kept  low.  VLSI  implementations  for  each  of  the  primitive  components 
needed  to  build  a  processing  element  were  presented.  A  partitioning  into  chips  of  a  42 
frame  DTW  processor  with  a  warp  factor  of  six  was  proposed.  Our  VLSI  DTW  proces¬ 
sor  is  both  flexible  and  modular.  The  design  is  independent  of  the  number  of 
coefficients  per  frame  and  the  precision  of  those  coefficients.  The  design  is  also  easily 
expandable  in  the  number  of  frames  per  utterance  and  the  warp  factor. 

The  ADD  primitive  component  has  been  fabricated  by  MOSIS,  while  the  other 
primitive  components  are  in  design.  Future  designs  will  use  a  1.2  micron  feature  size 
now  supported  by  MOSIS.  This  will  reduce  the  area  of  each  processing  element,  but 
still  no  more  than  nine  processing  elements  will  fit  on  one  chip  due  to  the  pin  con¬ 
straints,  Efforts  are  also  underway  to  combine  the  SUB  and  ABS  into  one  primitive 
component,  as  well  as  the  three  MIN  components.  This  will  allow  us  to  reduce  the  PE 
cycle  time  to  two  clock  cycles  (32  nsec.).  Primitive  components  to  provide  other  local 
distance  measures  are  also  being  investigated.  The  VLSI  design  effort  is  supported  by  a 
number  of  CAD  tools  under  development  at  Penn  State  (ref).  Ultimately,  these  tools 
will  provide  automatic  layout  of  architectures  like  the  DTW  processor  from  a  high-level 
algorithmic  description  of  the  architecture. 
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